Structural metadata annotation: moving beyond English
نویسندگان
چکیده
The goal of metadata extraction (MDE) is to enable technology that can take raw speech-to-text output and refine it into forms that are more useful to humans and to downstream automatic processes. Starting in 2003, a structural metadata annotation task was defined for English as part of the DARPA EARS Program. A significant new challenge for MDE is the addition of new languages. This paper reports on work undertaken to apply MDE annotation to data from three very different languages: Mandarin Chinese, Levantine Arabic, and conversational Czech. Details of annotation task modifications are provided for each language; along with a general overview of data and annotation tools for non-English MDE.
منابع مشابه
Czech spontaneous speech corpus with structural metadata
This paper describes a Czech spontaneous speech corpus consisting of radio talk show recordings. As the first complete non-English MDE corpus, it has been annotated with structural metadata information beyond the words that is critical to both increasing transcript readability and allowing application of downstream NLP methods. Metadata annotation involves partitioning verbatim transcripts into...
متن کاملThe Czech Broadcast Conversation Corpus
This paper presents the final version of the Czech Broadcast Conversation Corpus that will shortly be released at the Linguistic Data Consortium (LDC). The corpus contains 72 recordings of a radio discussion program, which yields about 33 hours of transcribed conversational speech from 128 speakers. The release does not only include verbatim transcripts and speaker information, but also structu...
متن کاملMetadata Based Annotation Infrastructure Offers Flexibility and Extensibility for Collaborative Applications and Beyond
In this position paper, we describe three user scenarios that benefit from metadata based annotation infrastructure. We explain how a basic annotation schema can be extended to support new scenarios. We also describe and evaluate some other features and modifications that are useful when implementing these scenarios. The most laborious part in the scenarios is the design and implementation of n...
متن کاملGross-grained RST through XML Metadata for Multilingual Document Generation
We present an RST-based discourse annotation proposal used in the construction of a trial multilingual XML-tagged corpus of teaching material in Basque, English and Spanish. The corpus feeds an experimental multilingual document generation system for the web. The main contributions of this paper are an implementation of RST through XML metadata and the adoption of gross-grained RST to avoid non...
متن کاملMultilingual Named Entity Recognition using Parallel Data and Metadata from Wikipedia
In this paper we propose a method to automatically label multi-lingual data with named entity tags. We build on prior work utilizing Wikipedia metadata and show how to effectively combine the weak annotations stemming from Wikipedia metadata with information obtained through English-foreign language parallel Wikipedia sentences. The combination is achieved using a novel semi-CRF model for forei...
متن کامل